Back

The Lancet Digital Health

Elsevier BV

Preprints posted in the last 7 days, ranked by how well they match The Lancet Digital Health's content profile, based on 25 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.

1
Harmonising UK primary care prescription records for research: A case study in the UK Biobank

Ytsma, C. R.; Torralbo, A.; Fitzpatrick, N. K.; Pietzner, M.; Louloudis, I.; Nguyen, D.; Ansarey, S.; Denaxas, S.

2026-04-22 health informatics 10.64898/2026.04.21.26351274 medRxiv
Top 0.1%
6.3%
Show abstract

Objective The aim of this study was to develop and validate an automated, scalable framework to harmonise fragmented UK primary care prescription records into a research-ready dataset by mapping four diverse medical ontologies to a unified, historically comprehensive reference standard. Materials and Methods We used raw prescription records for consented participants in the UK Biobank, in which participants are uniquely characterized by multiple data modalities. Primary care data were preprocessed by selecting one drug code if multiple were recorded, cleaning codes to match reference presentations, expanding code granularity based on drug descriptions, and updating outdated codes to a single reference version. Harmonisation entailed mapping British National Formulary (BNF) and Read2 codes to dm+d, the universal NHS standard vocabulary for uniquely identifying and prescribing medicines. Harmonised dm+d records were then homogenised to a single concept granularity, the Virtual Medicinal Product (VMP). We validated our methods by creating medication profiles mapping contemporary drug prescribing patterns in 312 physical and mental health conditions. Results We preprocessed 57,659,844 records (100%) from 221,868 participants (100%). Of those, 48,950 records were dropped due to lack of drug code. 7,357,572 records (13%) used multiple ontologies. Most (76%) records were encoded in BNF and most had the code granularity expanded via the drug description (N=28,034,282; 49%). 41,244,315 records (72%) were harmonised to dm+d and 99.98% of these were converted to VMP as a homogeneous dataset. Across 312 diseases, we identified 23,352 disease-drug associations with 237 medications (represented as BNF subparagraphs) that survived statistical correction of which most resembled drug - indication pairs. Conclusion Our methodology converts highly fragmented and raw prescription records with inconsistent data quality into a streamlined, enriched dataset at a single reference, version, and granularity of information. Harmonised prescription records can be easily utilised by researchers to perform large-scale analyses in research.

2
Prognosis of stroke subtypes in whole population health systems data: a matched cohort study

Hosking, A.; Iveson, M. H.; Sherlock, L.; Mukherjee, M.; Grover, C.; Alex, B.; Parepalli, S.; Mair, G.; Doubal, F.; Whalley, H. C.; Tobin, R.; Wardlaw, J. M.; Al-Shahi Salman, R.; Whiteley, W. N.

2026-04-20 neurology 10.64898/2026.04.17.26351150 medRxiv
Top 0.1%
6.1%
Show abstract

Background Outcome after stroke varies according to stroke subtype by location, but healthcare systems data studies do not include subtyping information. We linked natural language processing (NLP) of brain imaging reports to routinely collected data to estimate risk of death and other outcomes after stroke subtypes in a nationwide dataset. Methods We applied a previously validated NLP algorithm to all CT and MRI head scan reports in Scotland between 2010 and 2018. We linked the reports to hospital readmissions, prescriptions and death data to identify and characterize people with stroke, and to categorize into deep and cortical ischemic stroke, deep and lobar intracerebral hemorrhage (ICH), subarachnoid hemorrhage, and subdural hemorrhage. We used a matched cohort design, and age- and sex-matched four controls per case who never had a stroke. By subtype, we estimated rehospitalization with stroke, myocardial infarction (MI), cancer, dementia, epilepsy and death, accounting for confounders and competing risk of death. Results From 785,331 people with a head scan, we identified 64,219 with clinical stroke phenotypes (mean age 73.4yrs, 49.5% male), and subtyped 12,616 with deep ischaemic stroke; 14,103 with cortical ischaemic stroke; 1,814 with deep ICH; and 1,456 with lobar ICH. There was higher absolute rate of 1-year hospital readmission for lobar compared with deep ICH (4.9% [95%CI 3.9% - 6.1%] vs 3.4% [2.6% - 4.3%]), higher risk of dementia beyond 6 months after lobar ICH compared to controls than for other stroke subtypes (aHR 3.5 [2.3-5.3]); and higher risk of MI within 6 months of cortical ischemic stroke than for other stroke subtypes (aHR 4.6 [3.4-6.3]). Conclusions NLP of free-text reports linked to coded data successfully subtyped stroke at scale, and we estimated risk of clinically relevant outcomes. Future work should use free text to enable large-scale audit and epidemiology of people with stroke.

3
The FEES Dysphagia Index: a bias-resilient continuous score that captures expert clinical judgment in 2,943 neurological inpatients

Werner, C. J.; Sanchez-Garcia, E.; Mall, B.; Meyer, T.; Pinho, J.; Schulz, J. B.; Schumann-Werner, B.

2026-04-21 neurology 10.64898/2026.04.20.26351259 medRxiv
Top 0.1%
4.8%
Show abstract

Multi-consistency testing during flexible endoscopic evaluation of swallowing (FEES) is clinically necessary but introduces selection bias: worst scores inflate severity because the number of consistencies tested covaries with disease severity. In this retrospective observational study of hospitalized neurological patients, we derived and validated the FEES Dysphagia Index (FDI) in two temporally independent cohorts (Cohort 1: 2013-2018, N=1,257; Cohort 2: 2021-2025, N=1,686) from a single center. FDI-S averages Penetration-Aspiration Scale (PAS) scores across tested consistencies (0-100 scale); FDI-E uses Yale Pharyngeal Residue scores; FDI-C combines both. Selection bias was quantified using sequential branching-tree inverse probability weighting (IPW). Worst PAS overestimated severity by 24%; FDI deviated by <2%. FDI-C was significantly superior to Worst PAS for hospital-acquired pneumonia (HAP; AUC 0.70 vs. 0.60, p<0.001), mortality (0.71 vs. 0.62, p=0.040), and restricted oral intake (0.90 vs. 0.74, p<0.001), and statistically equivalent to clinician-rated severity. FDI-C mapped linearly onto ordinal Functional Oral Intake Scale values (FOIS; proportional odds RCS p=0.99). With functional status and diagnosis, FDI-C reconstructed the clinicians oral intake recommendation with AUC up to 0.93. The FDI-C-mortality relationship was sigmoidal with a clinically relevant transition zone between [~]50 and [~]85. FDI-C is a bias-resilient, bedside-calculable score with interval-scale properties that captures expert clinical judgment, suitable as both a clinical decision support tool and a continuous research endpoint.

4
Data Resource Profile: EST-Health-30

Reisberg, S.; Oja, M.; Mooses, K.; Tamm, S.; Sild, A.; Talvik, H.-A.; Laur, S.; Kolde, R.; Vilo, J.

2026-04-24 epidemiology 10.64898/2026.04.21.26351087 medRxiv
Top 0.1%
3.7%
Show abstract

Background: The increasing availability of routinely collected health data offers new opportunities for population-level research, yet access to comprehensive, linked, and standardised datasets remains limited. We describe EST-Health-30, a large-scale, population-representative health data resource from Estonia. Methods: EST-Health-30 comprises a random 30% sample of the Estonian population (~500,000 individuals), with longitudinal data from 2012 to 2024 and annual updates planned through 2026. Individual-level records are linked across five nationwide databases, including electronic health records, health insurance claims, prescription data, cancer registry, and cause of death records. A privacy-preserving hashing approach ensures consistent cohort inclusion over time while maintaining pseudonymisation. All data are harmonised to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (version 5.4) using international standard vocabularies. Data quality was assessed using established OMOP-based validation frameworks. Results: The dataset contains rich multimodal information on diagnoses, procedures, laboratory measurements, prescriptions, free-text clinical notes, healthcare utilisation, and costs, with high population coverage and longitudinal depth. Data quality assessment showed high completeness and consistency, with 99.2% of applicable checks passing. The age-sex distribution closely reflects the national population, supporting representativeness, though coverage is marginally below the target 30% (29.2%), primarily attributable to recent immigrants without health system contact. The dataset enables construction of detailed clinical cohorts, analysis of disease trajectories, and evaluation of healthcare utilisation and outcomes across the life course. Conclusions: EST-Health-30 is a comprehensive, standardised, and population-representative real-world data resource that supports epidemiological, clinical, and methodological research. Its alignment with the OMOP CDM facilitates reproducible analytics and participation in international federated research networks, while secure access infrastructure ensures compliance with data protection regulations.

5
Most Instability Phases Resolve: Empirical Evidence for Trajectory Plasticity in Multimorbidity Care from Longitudinal Relational Monitoring

Martin, C. M.; henderson, i.; Campbell, D.; Stockman, K.

2026-04-24 health informatics 10.64898/2026.04.22.26351537 medRxiv
Top 0.1%
3.6%
Show abstract

Background: The instability-plasticity framework proposes that multimorbidity trajectories periodically enter instability phases that are vulnerable to escalation but also potentially modifiable through relational intervention. Whether such phases commonly resolve without acute care, or predominantly progress to hospitalisation, has not been quantified at scale. Objective: To quantify instability window outcomes across a longitudinal monitoring cohort; to test whether the characteristics distinguishing admitted from resolved windows reflect within-patient trajectory dynamics or between-patient severity; and to characterise which patient-reported and operator-rated signals reliably precede admission, using both a curated pilot sub-cohort and the full monitoring cohort with an explicit cross-cohort comparison. Methods: Two complementary analyses were conducted on data from the MonashWatch Patient Journey Record (PaJR) relational telehealth system. Instability windows were identified algorithmically (>=2 consecutive calls with Total_Alerts >=3) across the full longitudinal dataset (16,383 calls, 244 patients, 2.5 years) and classified by linkage to ED and hospital admission data. Window characteristics were compared at window, patient, and paired within-patient levels. Pre-admission signal cascades were analysed in two configurations: a curated pilot sub-cohort (64 patients, 280 calls, +/-10-day window, 103 admissions, December 2016-September 2017) and the full monitoring cohort (175 patients, 1,180 pre-admission calls, +/-14-day window, December 2016-July 2019). A three-way cross-cohort comparison decomposed differences between the two configurations into pipeline and population effects. Results: 621 instability windows were identified across 157 patients (64% of the monitored cohort). 67.3% resolved without hospital admission or ED attendance, a rate stable across alert thresholds 1-5. In paired within-patient analysis (n = 70), duration in days (p = 0.002) and multi-domain breadth (p < 0.001) distinguished admitted from resolved windows; alert intensity did not. In the pilot sub-cohort, patient-reported illness prognosis (Q21) was the dominant pre-admission signal (GEE beta = +0.058, AUC = 0.647, p-BH = 0.018). This finding did not replicate in the full cohort: Q21 was non-significant (GEE beta = -0.008, p = 0.154, AUC = 0.507). Cross-cohort analysis identified selective curation of the pilot sub-cohort as the primary explanation. In the full cohort, six signals escalated significantly before admission after Benjamini-Hochberg correction: total alerts, health impairment (Q26), red alerts, self-rated health (Q3), patient concerns (Q1), and operator concern (Q34). Health impairment achieved the highest individual AUC (0.605) and showed the longest pre-admission lead. No individual signal exceeded AUC 0.61. Conclusions: Two thirds of instability phases resolve without hospitalisation, providing direct empirical support for trajectory plasticity as a clinically frequent phenomenon. Within the same patient, persistence - in duration and in the consistency of high-severity multi-domain flagging across calls - distinguishes trajectories that tip into admission from those that resolve. The Q21 signal reversal between cohorts illustrates how selective curation can produce compelling but non-replicable findings in monitoring research. In the full population, objective alert signals and operator judgement, rather than patient illness prognosis, carry the pre-admission signal

6
Temporal features of the built environment and associations with drowning mortality: A global satellite-based analysis

Essex, R.; Lim, S.; Jagnoor, J.

2026-04-21 public and global health 10.64898/2026.04.19.26351237 medRxiv
Top 0.2%
3.5%
Show abstract

BackgroundDrowning remains a major global public health challenge. This study examined whether the timing and trajectories of urbanisation--beyond the current built environment--are associated with subnational drowning mortality. MethodsWe linked satellite-derived measures of built-environment change (GHSL), population crowding (WorldPop), surface water exposure (JRC Global Surface Water), and infrastructure proxies (VIIRS/DMSP nighttime lights) to GBD 2021 drowning mortality estimates across 203 ADM1 regions in 12 countries (2006-2021; 3,248 region-year observations). Temporal predictors captured recent expansion, development "newness" ([&le;]10-year built share), acceleration/volatility, and a crowdingxgrowth interaction. We screened predictors using LASSO (10-fold cross-validation) and fitted mixed-effects models with region random intercepts. Distributed-lag models tested temporal precedence and development age, and income-stratified models assessed heterogeneity. ResultsAdding temporal predictors improved fit beyond contemporaneous built-environment measures ({Delta}AIC=177; {Delta}BIC=147). In adjusted models, crowdingxgrowth was strongly positively associated with drowning mortality, and a higher share of recent development was associated with higher mortality. Lag models showed a development age gradient: older built environment was most protective. Associations differed by income group, with several key coefficients reversing sign across strata. DiscussionDrowning mortality appears shaped by development histories as well as present-day conditions, with risk concentrated in rapidly changing, dense settings and the newest built environments. Cross-context heterogeneity suggests mechanisms and prevention priorities are unlikely to be uniform. ConclusionsDevelopment timing and trajectories help explain subnational drowning mortality beyond current built form alone. Prevention and planning should prioritise transition-period safety strategies in newly developing and rapidly densifying areas.

7
Design and preliminary safety validation of a hybrid deterministic-AI triage system for multilingual primary healthcare: a WhatsApp-based vignette study in South Africa

Nkosi-Mjadu, B. E.

2026-04-22 health informatics 10.64898/2026.04.21.26349781 medRxiv
Top 0.2%
3.2%
Show abstract

BackgroundSouth Africas public healthcare system serves most of the population through approximately 3,900 primary healthcare clinics characterised by long waiting times and high volumes of repeat-prescription visits. No published pre-arrival digital triage system operates across all 11 official South African languages while aligning with the South African Triage Scale (SATS). This paper reports the design and preliminary safety validation of BIZUSIZO, a hybrid deterministic-AI WhatsApp triage system. MethodsBIZUSIZO delivers SATS-aligned triage via WhatsApp, combining AI-assisted free-text classification (Claude Haiku 4.5) with a Deterministic Clinical Safety Layer (DCSL) that overrides AI output for 53 clinical discriminator categories (14 RED, 19 ORANGE, 20 YELLOW) coded in all 11 official languages and independent of AI availability. A five-domain risk factor assessment can only upgrade triage level. One hundred and twenty clinical vignettes in patient language (English, isiZulu, isiXhosa, Afrikaans; 30 per language) were scored against a developer-assigned gold standard with independent blinded nurse review. A 121-vignette multilingual DCSL safety consistency check across all 11 languages and a 220-call post-hoc framing sensitivity evaluation (110 paired vignettes) were also conducted. ResultsUnder-triage was 3.3% (4/120; 95% CI: 0.9%-8.3%) with no RED under-triage; exact concordance was 80.0% (96/120) and quadratic weighted kappa 0.891 (95% CI: 0.827-0.932). One two-level under-triage was observed on a non-RED presentation (V072, isiXhosa burns vignette, ORANGEGREEN); one two-level over-triage was observed (V054, isiZulu deep laceration, YELLOWRED). In the framing sensitivity evaluation, AI-only classification achieved 50.9% RED invariance under adversarial framing; full-pipeline classification achieved 95.0% in four validated languages, with the DCSL rescuing 18 of 23 AI drift cases. ConclusionsA hybrid deterministic-AI triage system with DCSL-based emergency detection achieved zero RED under-triage and consistent RED detection across all 11 official languages. The 16.7% over-triage rate falls within published South African SATS ranges (13.1-49%). A single two-level under-triage event was observed on an isiXhosa burns vignette (ORANGEGREEN) and is discussed in Limitations. Findings are preliminary; prospective validation against independent nurse triage is the necessary next step.

8
A Systematic Exploration of LLM Behavior for EHR phenotyping

Yamga, E.; Murphy, S.; Despres, P.

2026-04-24 health informatics 10.64898/2026.04.16.26350890 medRxiv
Top 0.2%
3.1%
Show abstract

Background Electronic health record (EHR) phenotyping underpins observational research, cohort discovery, and clinical trial screening. Large language models (LLMs) offer new capabilities for extracting phenotypes from unstructured text, but their performance depends on pipeline design choices-including prompting, text segmentation, and aggregation. No systematic framework has previously examined how these parameters shape accuracy and reproducibility. Methods We evaluated LLM-based phenotyping pipelines using 1,388 discharge summaries across 16 clinical phenotypes. A full factorial experiment with LLaMA-3B, 8B, and 70B systematically varied three pipeline components: prompting (zero-shot, few-shot, chain-of-thought, extract-then-phenotype), chunking (none, naive, document-based), and aggregation (any-positive, two-vote, majority), yielding 24 configurations per model. To compare intrinsic model capabilities, biomedical domain-adapted, commercial frontier (LLaMA-405B, GPT-4o, Gemini Flash 2.0), and reasoning-optimized models (DeepSeek-R1) were evaluated under a fixed configuration. Performance was assessed using precision, recall, and macro-F1; secondary analyses examined prediction consistency (Shannon entropy), self-confidence calibration, and the development of a taxonomy of recurrent model errors. Results Factorial ANOVAs showed that chunking and aggregation were the dominant drivers of performance, whereas the prompting strategy contributed minimally. Configuration effects were stable across model sizes, with no significant Model x Parameter interactions. Phenotype difficulty varied substantially (macro-F1 = 0.40-0.90), yet the highest-performing configuration-whole-document inference without aggregation-was consistent across phenotypes, as confirmed by mixed-effects modeling. In cross-model comparisons, DeepSeek-R1 achieved the highest macro-F1 (0.89), while LLaMA-70B matched GPT-4o and LLaMA-405B at substantially lower cost. Prediction entropy was low overall and driven primarily by phenotype difficulty rather than prompting or temperature. Self-confidence calibration was only moderately informative: high-confidence predictions were more accurate, but larger models exhibited systematic overconfidence. Conclusions LLM performance in EHR phenotyping is governed primarily by input structure and model capacity, not prompt engineering. Simple, document-level inference yields robust performance across diverse phenotypes, providing practical design guidance for LLM-based cohort identification while underscoring the continued need for human oversight for challenging phenotypes.

9
Addition of Bupropion or Varenicline to Nicotine Replacement Therapy After Acute Coronary Syndrome: A Propensity-Matched Real-World Analysis

Qadeer, A.; Gohar, N.; Maniyar, P.; Shafi, N.; Juarez, L. M.; Mortada, I.; Pack, Q. R.; Jneid, H.; Gaalema, D. E.

2026-04-23 cardiovascular medicine 10.64898/2026.04.21.26351432 medRxiv
Top 0.2%
2.6%
Show abstract

Introduction: Smoking cessation after acute coronary syndrome (ACS) is a Class I recommendation, yet prescription pharmacotherapy use remains low and its real-world cardiovascular effectiveness when added to nicotine replacement therapy (NRT) is poorly characterized. Methods: We conducted a retrospective cohort study using the TriNetX US Collaborative Network (67 healthcare organizations). Adults hospitalized with ACS who received NRT within one month, serving as a proxy for active smoking status, were identified. Two co-primary propensity-matched (1:1, 50 covariates, caliper 0.10 SD) comparisons evaluated bupropion + NRT and varenicline + NRT individually versus NRT alone; a supportive analysis evaluated combined pharmacotherapy versus NRT alone. All-cause mortality was the primary endpoint. Secondary outcomes included MACE, heart failure exacerbations, major bleeding, TIA/stroke, emergency rehospitalizations, and cardiac rehabilitation utilization, assessed at 6 months and 1 year via Kaplan-Meier analysis. Hazard ratios (HRs) greater than 1.0 indicate higher hazard in the NRT-only group. Results: After matching, the combined analysis comprised 8,574 pairs, the bupropion analysis 4,654 pairs, and the varenicline analysis 2,126 pairs. At 1 year, the combined pharmacotherapy group had significantly lower all-cause mortality (HR 1.26, 95% CI 1.16-1.37), MACE (HR 1.16, 95% CI 1.12-1.21), heart failure exacerbations (HR 1.16, 95% CI 1.08-1.25), major bleeding (HR 1.18, 95% CI 1.08-1.28), and greater cardiac rehabilitation utilization (HR 0.82, 95% CI 0.74-0.92; all p < 0.001). TIA/stroke did not differ significantly. Six-month results were consistent. Both varenicline and bupropion individually showed lower mortality and MACE. A urinary tract infection falsification endpoint showed no between-group differences, supporting matching validity. The pharmacotherapy group had higher rates of new-onset depression, driven predominantly by bupropion recipients. Conclusions: In this propensity-matched real-world analysis, adding prescription smoking cessation pharmacotherapy to NRT after ACS was associated with lower mortality and fewer adverse cardiovascular events, supporting broader integration into post-ACS care pathways.

10
CohortContrast: An R Package for Enrichment-Based Identification of Clinically Relevant Concepts in OMOP CDM Data

Haug, M.; Ilves, N.; Umov, N.; Loorents, H.; Suvalov, H.; Tamm, S.; Oja, M.; Reisberg, S.; Vilo, J.; Kolde, R.

2026-04-23 health informatics 10.64898/2026.04.22.26351461 medRxiv
Top 0.2%
2.4%
Show abstract

Abstract Objective To address the unresolved bottleneck of selecting cohort-relevant clinical concepts for treatment trajectory analysis in observational health data, we introduce CohortContrast, an OMOP-compatible R package for enrichment-based concept identification, temporal and semantic noise reduction, and concept aggregation, enabling cohort-level characterization and downstream trajectory analysis. Materials and Methods We developed CohortContrast and applied it to OMOP-mapped observational data from the Estonian nationwide OPTIMA database, which includes all cases of lung, breast, and prostate cancer, focusing here on lung and prostate cancer cohorts. The workflow combines target-control statistical enrichment, temporal/global noise filtering, hierarchical concept aggregation and correlation-based merging, with optional patient clustering for downstream trajectory exploration. We validated the approach with a clinician-based plausibility assessment of extracted diagnosis-concept pairs and evaluated a large language model (LLM) as an auxiliary filtering step. Results We analyzed 7,579 lung cancer and 11,547 prostate cancer patients. The workflow reduced concept dimensionality from 5,793 to 296 concepts (94.9%) in lung cancer and from 5,759 to 170 concepts (97.0%) in prostate cancer, and identified three exploratory patient subgroups in both cohorts. In a plausibility assessment of 466 diagnosis-concept pairs, validators rated 31.3% as directly linked and 57.5% as indirectly linked. Discussion CohortContrast reduces manual concept curation by prioritizing and aggregating cohort-relevant concepts while preserving clinically interpretable treatment patterns in OMOP-based real-world data. Conclusion CohortContrast enables scalable reduction of broad OMOP concept spaces into clinically interpretable, cohort-specific representations for exploratory trajectory analysis and real-world evidence research.

11
Accessible and Reproducible Renal Cell Carcinoma Research Through Open-Sourcing Data and Annotations

de Boer, S.; Häntze, H.; Ziegelmayer, S.; van Ginneken, B.; Prokop, M.; Bressem, K. K.; Hering, A.

2026-04-23 radiology and imaging 10.64898/2026.04.22.26351451 medRxiv
Top 0.2%
2.1%
Show abstract

Background: Medical imaging, especially computed tomography and magnetic resonance imaging, is essential in clinical care of patients with renal cell carcinoma (RCC). Artificial intelligence (AI) research into computer-aided diagnosis, staging and treatment planning needs curated and annotated datasets. Across literature, The Cancer Genome Atlas (TCGA) datasets are widely used for model training and validation. However, re-annotation is often necessary due to limited access to public annotations, raising entry barriers and hindering comparison with prior work. Methods: We screened 1915 CT scans from three TCGA-RCC databases and employed a segmentation model to annotate kidney lesion. After a meta-data-based exclusion step, we hosted a reader study with all papillary (n=56), chromophobe (n=27) and 200 randomly selected clear cell RCC cases. Two students quality checked and corrected the data as well as annotated tumors and cysts. Uncertain cases were checked by a board-certified radiologist. Results: After data exclusion and quality control a total of 142 annotated CT scans from 101 patients (26 female, 75 male, mean age 56 years) remained. This includes 95 CTs with clear cell RCC, 29 with papillary RCC and 18 with chromophobe RCC. Images and voxel-level annotations of kidneys and lesions are open sourced at https://zenodo.org/records/19630298. Conclusion: By making the annotations open-source, we encourage accessible and reproducible AI research for renal cell carcinoma. We invite other researchers who have previously annotated any of these cohorts to share their annotations.

12
Where risk becomes visible: a layered fixed-policy framework for diabetic kidney disease screening in type 2 diabetes

Khattab, A.; Wang, Z.; Srinivasasainagendra, V.; Tiwari, H. K.; Loos, R.; Limdi, N.; Irvin, M. R.

2026-04-22 nephrology 10.64898/2026.04.21.26351384 medRxiv
Top 0.3%
1.8%
Show abstract

BackgroundDiabetic kidney disease (DKD) is a leading cause of kidney failure in individuals with type 2 diabetes (T2D), yet risk identification in routine clinical practice remains incomplete. A critical and often overlooked barrier is risk observability: how much of a patients underlying risk is actually captured in their clinical record at the time of screening. Existing prediction models evaluate performance using model-specific thresholds, making it difficult to understand how additional data sources alter real-world screening behavior or which individuals benefit when models are expanded. MethodsWe developed a series of five nested machine learning models evaluated at a one-year landmark following T2D diagnosis using data from the All of Us Research Program (N = 39,431; cases = 16,193). Each successive model added a distinct information layer -- intrinsic risk, laboratory snapshots, medication exposure, longitudinal care trajectories, and social determinants of health (SDOH) -- while retaining all prior features. All models were evaluated under a fixed screening policy targeting 90% specificity, so that the false positive rate remained constant as the information available to the model grew. External validation was conducted in the BioMe Biobank (N = 9,818) without retraining. ResultsDiscrimination improved consistently across layers, from AUROC 0.673 (M1) to 0.797 (M5). Under the fixed screening policy, sensitivity nearly doubled from 0.27 to 0.49, with a cumulative recovery of 30.4% of cases missed by the base model. Gains were driven by distinct subgroups at each transition: laboratory features identified biologically high-risk individuals; medication features captured those with high treatment intensity reflecting advanced cardiometabolic burden; longitudinal care trajectory features rescued cases with biological instability observable only through repeated measurements; and SDOH features recovered individuals with limited clinical observability, with rescue probability highest among those with the fewest recorded monitoring domains. Sparse data in the clinical record indicated low observability, not low risk. Social and genetic features each contributed most when downstream physiologic signal was limited, supporting a contextual rather than universal role for each. In BioMe, discrimination was attenuated (M4 AUROC 0.659), but the relative ordering of information layers was fully preserved, and a systematic upward shift in predicted probability distributions underscored the need for recalibration before deployment in a new setting. ConclusionsDKD risk detection in T2D is substantially improved by integrating complementary information layers under a fixed clinical screening policy, with gains arising from distinct domains that identify at-risk individuals in different clinical contexts. The layered landmark framework introduced here reveals how risk observability -- shaped by monitoring intensity, healthcare engagement, and access -- determines what a screening model can detect, and provides a foundation for context-aware EHR-based screening that accounts for data availability at the time of risk assessment. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=140 SRC="FIGDIR/small/26351384v1_ufig1.gif" ALT="Figure 1"> View larger version (51K): org.highwire.dtl.DTLVardef@1cc7f4borg.highwire.dtl.DTLVardef@b92956org.highwire.dtl.DTLVardef@48ffbcorg.highwire.dtl.DTLVardef@8dc627_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOGraphical abstract.C_FLOATNO Study design and layered DKD screening framework The top row defines the cohort timeline, in which predictors are derived from clinical data collected between T2D diagnosis and the 1-year landmark, and incident DKD is ascertained after the landmark. The second row depicts the nested model architecture, in which five successive models sequentially incorporate intrinsic risk, laboratory snapshot features, medication exposure, longitudinal care trajectories, and social determinants of health, while retaining all features from prior layers. The third row summarizes model development in the All of Us Research Program (N = 39,431) and external validation in the BioMe Biobank (N = 9,818), where the same trained models and risk thresholds were applied without retraining. The bottom row highlights the three evaluation domains: predictive performance, fixed-policy screening, and missed-case recovery context. DKD, diabetic kidney disease; T2D, type 2 diabetes; PRS, polygenic risk scores; AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve; PPV, positive predictive value; SHAP, SHapley Additive exPlanations. C_FIG

13
Individualized Forecasting of Headache Attack Risk Using a Continuously Updating Model

Houle, T. T.; Lebowitz, A.; Chtay, I.; Patel, T.; McGeary, D. D.; Turner, D. P.

2026-04-22 neurology 10.64898/2026.04.20.26350119 medRxiv
Top 0.3%
1.8%
Show abstract

ImportanceMigraine attacks often occur unpredictably, limiting the ability of individuals to initiate timely preventive or preemptive treatment. Short-term probabilistic forecasting of migraine risk could enable more targeted management strategies. ObjectiveTo externally validate the previously developed Headache Prediction Model (HAPRED-I), evaluate an updated continuously learning model (HAPRED-II), and assess the feasibility and short-term safety of delivering individualized probabilistic migraine forecasts directly to patients. Design, Setting, and ParticipantsProspective 8-week cohort study conducted remotely at two academic medical centers in the United States (Massachusetts General Hospital and Wake Forest Health Sciences) between 2015 and 2019. Adults with recurrent migraine or tension-type headache completed twice-daily electronic diaries. A total of 230 participants contributed 23,335 diary entries across 11,862 participant-days of observation. Main Outcomes and MeasuresOccurrence of a headache attack within 24 hours following each evening diary entry. Model performance was evaluated using discrimination (area under the receiver operating characteristic curve [AUC]) and calibration. ResultsExternal validation of HAPRED-I demonstrated modest discrimination (AUC, 0.59; 95% CI, 0.57-0.61) and poor calibration, with predicted probabilities consistently exceeding observed headache risk. In contrast, the continuously updating HAPRED-II model demonstrated progressive improvement in predictive performance as participant-specific data accumulated. Discrimination increased from an AUC of 0.59 (95% CI, 0.57-0.61) during the first 14 days to 0.66 (95% CI, 0.63-0.70) after the first month, accompanied by improved calibration across predicted risk levels. Over the study period, 6999 individualized forecasts were delivered directly to participants. No evidence suggested that receipt of forecasts was associated with increasing headache frequency or worsening predicted headache risk trajectories. Conclusions and RelevanceA static migraine forecasting model demonstrated limited transportability to new individuals. In contrast, models that continuously update within individuals may improve predictive accuracy over time and enable real-time delivery of personalized migraine risk forecasts. Further work incorporating richer physiologic and contextual predictors will likely be necessary before such systems can reliably guide clinical treatment decisions.

14
Improving Care by FAster risk-STratification through use of high sensitivity point-of-care troponin in patients presenting with possible acute coronary syndrome in the EmeRgency department (ICare-FASTER): a stepped-wedge cluster randomized trial

Than, M.; Pickering, J. W.; Joyce, L. R.; Buchan, V. A.; Florkowski, C. M.; Mills, N. L.; Hamill, L.; Prystowsky, J.; Harger, S.; Reed, M.; Bayless, J.; Feberwee, A.; Attenburrow, T.; Norman, T.; Welfare, O.; Heiden, T.; Kavsak, P.; Jaffe, A. S.; apple, f.; Peacock, W. F.; Cullen, L.; Aldous, S.; Richards, A. M.; Lacey, C.; Troughton, R.; Frampton, C.; Body, R.; Mueller, C.; Lord, S. J.; George, P. M.; Devlin, G.

2026-04-23 cardiovascular medicine 10.64898/2026.04.21.26351433 medRxiv
Top 0.4%
1.7%
Show abstract

BACKGROUND Point-of-care (POC) high-sensitivity cardiac troponin (hs-cTn) testing has the potential to expedite decision-making and reduce emergency department (ED) length of stay for patients presenting with possible myocardial infarction (MI) by ensuring that results are consistently available when looked for by clinicians. We assessed the real-life effectiveness and safety of implementing POC hs-cTn testing in the ED. METHODS We conducted a pragmatic, stepped-wedge cluster randomized trial. The control arm was usual care with an accelerated diagnostic pathway utilizing a single-sample rule-out step with a central laboratory hs-cTn assay. The intervention arm used the same pathway with a POC hs-cTnI. The primary effectiveness outcome was ED length of stay assessed using a generalized linear mixed model, and the safety outcome was 30-day MI or cardiac death. RESULTS Six sites participated with 59,980 ED presentations (44,747 individuals, 61{+/-}19 years, 49.5% female) from February 2023 to January 2025, in which 31,392 presentations were during the intervention arm. After adjustment for co-variates associated with length of stay, the intervention reduced length of stay by 13% (95% confidence intervals [CI], 9 to 16%. P<0.001), corresponding to a reduction of 47 minutes (95%CI, 33 to 61 minutes) from a mean length of stay in the control arm of 376 minutes. The 30-day MI or cardiac death rate was similar in the control and intervention arms (0.39% and 0.39% respectively, P=0.54). CONCLUSIONS Implementation of whole-blood hs-cTnI testing at the POC into an accelerated diagnostic pathway was safe and reduced length of stay in the ED compared with laboratory testing.

15
Dissecting clinical reasoning failures in frontier artificial intelligence using 10,000 synthetic cases

Auger, S. D.; Varley, J.; Hargovan, M.; Scott, G.

2026-04-23 neurology 10.64898/2026.04.22.26351488 medRxiv
Top 0.4%
1.7%
Show abstract

Background: Current medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. Methods: We generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT 5.2/5 mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. Results: Subspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5 to 100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for more than 91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% 95% CI 5.6 to 8.8; Pro: 15.8% 95% CI 13.6 to 18.1) compared to GPT 5 mini (23.5% 95% CI 20.8 to 26.1), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT 5.2; 6.4% GPT 5 mini) compared to below 1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as more than 14 days old. Conclusion: Automated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk.

16
Research Paper on AuditMed: A Single-File, Browser-Based Clinical Evidence Audit Platform Architecture, Current Capabilities, and Proposed Applications in Drug Informatics and Pharmacy Education

Ferguson, D. J.

2026-04-20 health informatics 10.64898/2026.04.19.26351188 medRxiv
Top 0.4%
1.7%
Show abstract

BackgroundClinical pharmacists, trainees, and educators rely on multi-database literature retrieval and structured evidence synthesis to answer drug-information questions. Existing workflows require navigation across PubMed, DailyMed, LactMed, interaction checkers, and specialty guideline repositories with manual de-duplication, appraisal, and synthesis. Commercial platforms that integrate these functions are costly and often unavailable in community, rural, and international training contexts. ObjectiveThis report describes the architecture of AuditMed, a single-file, browser-based clinical evidence audit platform, and reports preliminary stress-test results against a complex multi-morbidity case corpus. AuditMed is intended for research and educational use and is not a substitute for clinical judgment or validated commercial clinical decision-support systems. MethodsAuditMed integrates nineteen free, publicly available clinical and biomedical application programming interfaces into a six-stage Search [-&gt;] Select [-&gt;] Parse [-&gt;] Analyze [-&gt;] Infer [-&gt;] Create pipeline and supports browser-local patient-case ingestion with regex-based HIPAA Safe Harbor de-identification. Preliminary stress-testing was conducted against eleven cases (Cases 30 through 40) from the Complex Clinical Case Compendium Software Validation Suite, each featuring over twenty concurrent active disease states. For each case, the one-click inference pipeline was executed with default settings and the full Clinical Inference Report was captured verbatim. No retrieval-sensitivity, synthesis-fidelity, or time-to-answer endpoints were pre-specified; the exercise was qualitative and oriented toward pipeline behavior under extreme multi-morbidity. ResultsThe pipeline completed without fatal errors for all eleven cases and produced a structured Clinical Inference Report in each instance. Quantitative-finding detection performed as designed for hematologic parameters and cardiac biomarkers. Two parser defects were identified and are reproduced in the appendix: an age-as-fever regex-precedence defect affecting seven cases and a diagnosis-versus-medication parsing defect affecting one case. Evidence-linkage rate varied from zero evidence-linked statements in seven cases to eleven in one case, reflecting dependence of the inference layer on MeSH-indexed literature coverage of the specific case diagnoses. ConclusionsAuditMed is an early-stage, open-source platform whose value at this stage is in providing a free, transparent, auditable workflow for multi-source evidence synthesis with explicit uncertainty flagging. The preliminary results document both robust end-to-end completion under extreme case complexity and specific, reproducible parser defects that will be addressed before formal evaluation. Planned evaluation studies are described.

17
MIMIC-IV-Phenotype-Atlas (MIPA) : A Publicly Available Dataset for EHR Phenotyping

Yamga, E.; Goudrar, R.; Despres, P.

2026-04-24 health informatics 10.64898/2026.04.16.26350888 medRxiv
Top 0.4%
1.7%
Show abstract

Introduction Secondary use of electronic health records (EHRs) often requires transforming raw clinical information into research-grade data. A central step in this process is EHR phenotyping - the identification of patient cohorts defined by specific medical conditions. Although numerous approaches exist, from ICD-based heuristics to supervised learning and large language models (LLMs), the field lacks standardized benchmark datasets, limiting reproducibility and hindering fair comparison across methods. Methods We developed the MIMIC-IV Phenotype Atlas (MIPA) dataset, an adaptation of MIMIC-IV that provides expert-annotated discharge summaries across 16 phenotypes of varying prevalence and complexity. Two independent clinicians reviewed and labeled the discharge summaries, resolving disagreements by consensus. In parallel, we implemented a processing pipeline that extracts multimodal EHR features and generates training, validation, and testing datasets for supervised phenotyping. To illustrate MIPA's utility, we benchmarked four phenotyping methods : ICD-based classifiers, keyword-driven Term Frequency-Inverse Document Frequency (TF-IDF) classifiers, supervised machine learning (ML) models, and LLMs on the task. Results The final MIPA corpus consists of 1,388 expert-annotated discharge summaries. Annotation reliability was high (mean document-level kappa = 0.805, mean label-level kappa = 0.771), with 91% of disagreements resolved through consensus review. MIPA provides high-quality phenotype labels paired with structured EHR features and predefined train/validation/test splits for each phenotype. In the benchmarking case study, LLMs achieved the highest F1 scores in 13 of 16 phenotypes, particularly for conditions requiring contextual interpretation of clinical narrative, while supervised ML offered moderate improvements over rule-based baselines. Conclusion MIPA is the first publicly available benchmark dataset dedicated to EHR phenotyping, combining expert-curated annotations, broad phenotype coverage, and a reproducible processing pipeline. By enabling standardized comparison across ICD-based heuristics, ML models, and LLMs, MIPA provides a durable reference resource to advance methodological development in automated phenotyping.

18
Built environment characteristics and drowning mortality: A global satellite-based analysis of urbanisation, infrastructure, and water proximity

Essex, R.; Lim, S.; Jagnoor, J.

2026-04-21 public and global health 10.64898/2026.04.19.26351236 medRxiv
Top 0.4%
1.7%
Show abstract

Drowning remains a major global public health challenge, yet how built environment characteristics shape population-level drowning risk remains poorly understood. This study linked satellite-derived built environment data to subnational drowning mortality estimates across 203 regions in 12 countries from 2006-2021. It found that built environment associations with drowning mortality are complex, non-linear, and shaped by development context. Urban extent was strongly protective, while built area near water showed protection overall but increased risk when combined with high population crowding. Almost all drowning mortality variance occurred between regions rather than within regions over time, indicating risk is predominantly determined by place-based characteristics. Income-stratified analyses revealed profound heterogeneity: crowding was protective in low-to middle-income settings but near-null in high-income regions, while waterfront development captured very different realities across contexts. These findings highlight the importance of tailoring drowning prevention strategies to local built environment configurations and development contexts.

19
TernTables: A Statistical Analysis and Table Generation Web Interface for Clinical and Biomedical Research

Preston, J. D.; Abadiotakis, H.; Tang, A.; Rust, C. J.; Halkos, M. E.; Daneshmand, M. A.; Chan, J. L.

2026-04-20 bioinformatics 10.64898/2026.04.15.717241 medRxiv
Top 0.4%
1.7%
Show abstract

Clinical research dissemination is frequently hindered by administrative friction and methodological inconsistency. To address these barriers, we developed TernTables, a freely available, open-source web application (https://www.tern-tables.com/) and R package (https://cran.r-project.org/package=TernTables) that streamlines the transition from raw data to formatted results for descriptive and univariate clinical reporting. The system integrates a client-side screening protocol for protected health information (PHI) with a rule-based decision tree that selects and executes appropriate frequency-based, parametric, or non-parametric statistical tests based on data distribution and class. TernTables generates publication-ready summary tables in Microsoft Word format, complemented by dynamically generated methods text and the underlying R code to ensure complete transparency and reproducibility. Validation using a landmark clinical trial dataset demonstrated concordance with established biostatistical approaches for descriptive and univariate analyses. TernTables is designed to supplement, not replace, formal statistical consultation by standardizing routine descriptive and univariate workflows, allowing biostatistical expertise to be focused on complex analyses and study design. By lowering technical and financial barriers, the platform democratizes access to rigorous statistical workflows while maintaining methodological excellence and reducing "researcher degrees of freedom."

20
Patient perspectives on living with hypertension: Social media listening analysis across predominantly high-income countries

Di Somma, S.; Gervais, R.; Bains, M.; Carter-Williams, S.; Messner, S.; Onsongo, N.

2026-04-23 cardiovascular medicine 10.64898/2026.04.22.26351483 medRxiv
Top 0.5%
1.5%
Show abstract

Background: Chronic conditions such as hypertension can significantly disrupt daily life and emotional wellbeing. The interaction between patients' perceptions, adherence to antihypertensive medication and quality of life (QoL) remains underexplored outside structured clinical settings. Objectives: To capture unprompted patient perspectives and assess whether hypertension affects QoL and to investigate if patient reported experiences are associated with self-reported antihypertensive medication adherence. Methods: Social media listening (SML) study analyzing 86,368 anonymized posts from individuals with hypertension in 12 countries, collected between January 2022 and May 2024. Posts from 11 countries (n=81,368) were analyzed using artificial intelligence-enabled natural language processing. Posts from China (n=5,000) were analyzed separately using a harmonized framework. Quantitative and qualitative methods assessed variations by country, age, and gender, and associations between emotional expression and antihypertensive medication adherence. Results: Across the 11-country core sample, 45% of posts mentioned at least one QoL impact, most commonly worry/anxiety (11%). Impacts varied across countries. Among 8,096 posts with age identified, individuals <40 years reported emotional balance impacts in 28% of posts versus 22% among those aged 40+. Work/Education impacts were mentioned in 17% of posts by those <40 years vs 12% in 40+. Among 7968 posts explicitly referencing adherence, expressed worry was associated with stricter adherence (62% association score), as were structured routines (79% score), home monitoring (77%), dietary changes (77%), and exercise (71%). In contrast, sadness/depression was associated with inconsistent adherence (71%), as were forgetfulness (79%), side effects (73%), and cost/insurance concerns (65%). Conclusions: These results emphasize the importance of the psychological and emotional impact of hypertension, including on adherence to medication regimens, reinforcing the value of a holistic approach to patient care.